This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
This makes data cleaning especially important and tricky.
You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.
Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (ie 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
The data is contained in three files:
Here is the schema and explanation of each variable in the files:
portfolio.json
profile.json
transcript.json
Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.
You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.
Running the entire notebook for the first time (without grid search) takes about 1.5 hours. Searching for hyperparameters in one of the implemeted model takes about 40 minutes more. So, if enabled, it would take more than 2 hours to run the complete notebook for the first time.
That's why a boolean was created to control the grid search, but it is set to True by default (if it was False, the best model found would be assigned manually from what was found in previous runs).
If you run the notebook for a second time, or if you previously created the datasets (by running the function "make_datasets.py" for example), the notebook running time will be reduced.
You can toggle the grid search on/off by changing the "grid_search" variable in the code accessible through the link below:
Toggle grid search (the "Top" link will take you back here)
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
from sklearn.cluster import KMeans
# read in the json files
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
print(portfolio.shape)
portfolio.sort_values(by='offer_type')
Let's substitute the "channels" column with one hot encoded columns.
portfolio = pp.channels_ohe(portfolio)
portfolio
print(profile.shape)
profile.head()
profile.info()
profile.isnull().sum()
The age = 118 looks strange
profile.age.value_counts()[:10]
It's almost certain that 118 was the value used for NaNs in the age. It also seems likely that the customers that have any missing data, have all the 'profile' data missing. Let's test that.
profile.age = profile.age.replace(118, np.nan)
profile.isnull().sum(axis=1).unique()
That means that customers have 3 missing values or none, as supposed earlier.
Let's separate the customers in those who have missing data and those who don't. If a customer has all its profile missing, and doesn't have any entry in the transcript dataframe, then it should be ignored (as we don't have any information of use about the client). Are there any of those?
profile['missing_demographics'] = profile.isnull().any(axis=1).astype(int)
profile.head()
utils.common_values(profile.id, transcript.person)
All the customers in the "profile" dataframe are in the "transcript" dataframe also. That means all customers give at least some information, and cannot be dropped.
profile.became_member_on = pd.to_datetime(profile.became_member_on,
format='%Y%m%d')
profile.head()
gender_dict = {'F': 0, 'M': 1, 'O': 2, None: np.nan}
gender_dict_inverse = {0: 'F', 1: 'M', 2: 'O', np.nan: None}
profile.gender = profile.gender.replace(gender_dict)
profile.head()
# Create a feature that shows when a customer became member,
# as the number of days since January 1st, 1970, to the signup date.
profile['member_since_epoch'] = (
profile.became_member_on - dt.datetime(1970,1,1)).dt.days
profile.head()
profile.income.hist(bins=30)
profile.gender.value_counts(dropna=False)
profile.age.hist(bins=30)
profile.became_member_on.hist(bins=30)
sns.pairplot(profile[['age',
'income',
'gender',
'member_since_epoch']].dropna())
There is a very strange age-income "stair" pattern... I assume that is due to the inner workings of the simulator.
Finding: The simulator seems to cap the income for younger customers, in discrete steps.
print(transcript.shape)
transcript.head()
transcript.info()
transcript.time.unique()
transcript.event.unique()
Let's get the "value" data out of the dictionaries, and then check each type of event separately.
transcript = pp.unwrap_transcript(transcript)
transcript[transcript.event == 'offer received'].head()
transcript[transcript.event == 'offer viewed'].head()
transcript[transcript.event == 'transaction'].head()
transcript[transcript.event == 'offer completed'].head()
The "event" column could be Label-encoded but, other than that, there don't seem to be many simple preprocessing actions to take. There is still a lot of data wrangling before having a well posed problem, though.
Let's make a dataset that doesn't take into account the particular person or offer, but rather their features.
static_df = pp.join_data(transcript, profile, portfolio)
static_df.head()
static_df.duration.unique()
merged_df = pp.join_data(transcript, profile, portfolio, static=False)
merged_df.head()
def get_differences(user_events):
return pd.DataFrame((user_events.time - user_events.time.shift(1)).values)
# "delays" shows the time differences between consecutive offers for each
# user.
sent = merged_df[merged_df.event == 'offer received']
delays = sent.groupby('person').apply(get_differences).rename(
columns={0: 'diff'})
delays = delays.unstack()
delays.head()
diffs = delays.values.flatten()
diffs = diffs[~np.isnan(diffs)]
diffs = pd.Series(diffs)
diffs.hist(bins=50)
plt.title('Time difference between two consecutive offers, for each user')
diffs.value_counts()
d_vals = diffs.sort_values().unique()
d_vals
d_vals[1:] - d_vals[:-1]
# Times contains the time at which each offer was sent
times = sent.groupby('person').apply(lambda x: pd.DataFrame(x.time.values)).rename(
columns={0: 'times'}).unstack()
times.head()
send_times = times.values.flatten()
send_times = pd.Series(send_times[~np.isnan(send_times)])
send_times.hist(bins=50)
send_times.value_counts()
Let's check if they are all multiples of 24.
time_values = send_times.sort_values().unique()
time_values / 24
Indeed, all the sending times are multiples of 24 (one day).
transactions = merged_df[merged_df.event == 'transaction']
transactions.time.hist(bins=50)
transaction_t_values = transactions.groupby('time').time.count()
transaction_t_values.head()
Transaction times go in steps of size 6, which is the minimum "tick" of simulator (or at least of the dataset provided). That means that transactions occur at any time of the day.
Let's look at the effects of the offers in transactions.
transaction_t_values.plot()
for time_value in time_values:
plt.vlines(time_value,
transaction_t_values.min(),
transaction_t_values.max(), 'r', label='offers sending')
plt.title('Amount of transactions and offers sending times')
The offers seems to be having a positive effect, overall. Also, that effect seems to last more than 168 hours after the start of the offer, which is about 7 days. Some offers have a duration of 10 days.
That means that there could be overlapping between the effects of one offer and the next. That should be checked.
Initially, at least, the effect of an offer will be assessed in one particular client, although it is clear that some offers also affect the behavior of other clients. In particular BOGO offers seem to be very suitable to acquire new clients, when an existing customer invites a coffe ("for free") to another person, for example. That won't be considered in this project, because it would be very difficult with the available data, but could be an interesting thing to research in the real world.
Let's study one case (if there is overlapping in that case, we will know that the simulator makes offers that may overlap. If not, we may look at other customers or find a general procedure to find overlapping in the full dataset.).
person = merged_df[merged_df.person == merged_df.person[0]]
offers = person[person.event.isin(['offer received', 'offer completed'])]
offers = offers[['event', 'time', 'duration', 'offer_id']]
reception = offers[offers.event == 'offer received'].copy()
reception['expected_finish'] = reception.time + 24 * reception.duration
reception
# This plots the times while each offer is "active".
for idx, row in reception.iterrows():
x = np.arange(merged_df.time.max()).astype(int) # Total time
x_on = np.arange(row.time, row.expected_finish).astype(int) # Time when the offer is "active"
y = np.zeros(merged_df.time.max())
y[x_on] = 1
plt.plot(x, y)
spending = np.zeros(merged_df.time.max())
spending[person.time] = person.amount
plt.plot(x, spending / spending.sum())
There is clear overlapping in that case. That suggests that overlapping of offer effects is not a rare phenomenon, and will obviously introduce some distortions.
Based on what was seen in the EDA, some functions were implemented to preprocess the raw data. The initial functions preprocess the data individually, and then other functions create a "static" dataset. It is "static" in the sense that it is created to be useful under the assumption that the customers' behavior is stationary, and that an estimator can ignore the "absolute" time feature.
The preprocessing process will be shown below, using the implemented functions. After that, the full process can be run by calling just two functions:
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
# read in the json files
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
portfolio = pp.process_portfolio(portfolio)
portfolio
profile = pp.process_profile(profile)
print(profile.shape)
profile.head()
transcript = pp.process_transcript(transcript)
print(transcript.shape)
transcript.head()
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
data, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)
print(data.shape)
data.head()
The process will be shown for a single customer, and then the full dataset is calculated.
person = data[data.person == data.person.iloc[0]]
received, viewed, completed, transactions = pp.split_transcript(person)
The next thing to do is to add a column that tell whether an offer was completed or not. That is information "from the future", so it will be typically used as a "target", and not as a "feature". The "fill_completion" function takes care of finding whether an offer will be completed or not. The docstring can be seen below.
print('fill_completion\n' + '-'*100)
print(pp.fill_completion.__doc__)
data = pp.fill_completion(received, completed)
Now, let's fill a column that checks if an offer was viewed in its "active" period (called "viewed"), and a "success" column that marks if the offer was completed after being viewed, within the "active" period.
The docstring for the function can be seen below.
Note: All the "informational" offers will be marked with "success = 0", because they are never "completed".
print('fill_viewed\n' + '-'*100)
print(pp.fill_viewed.__doc__)
data = pp.fill_viewed(data, viewed)
The next thing was to record the spending of each customer in the "offer's duration" (from reception until expiration), and the same amount from reception until "completion or expiration".
The corresponding "profit" columns were also created. To calculate realistic profits the "cost of production" would be needed. Also there is the clear fact that some offers are not intended to give profits in their duration or until completed but in a longer period. In the case of BOGO offers, it is clear that the objective is not that the customer "completes" the offer, but rather the posterior or secondary effects of the completion. In any case, given the lack of further information, the "profits" were calculated as the customer's "spending" minus the paid "reward" if the offer was completed.
The "actual reward" column shows the paid reward (zero if the offer was not completed).
print('fill_profits\n' + '-'*100)
print(pp.fill_profits.__doc__)
data = pp.fill_profits(data, transactions)
data = data.drop(['event', 'reward', 'amount'], axis=1)
data
And that is how the "static" dataset looks for one customer alone. The entire process, for all the customers is resumed in the two functions below. It may take several minutes to run.
# Read the data
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
# Basic Preprocessing
print('Basic preprocessing')
%time data, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)
# Create the static dataset (only if it doesn't exist already: it takes long)
print('Creating the static dataset (this may take several minutes)')
if os.path.exists(os.path.join(DATA_INTERIM, 'static_data.pkl')):
static_data = pd.read_pickle(os.path.join(DATA_INTERIM, 'static_data.pkl'))
else:
%time static_data = pp.generate_static_dataset(data)
# Save the static dataset
print('Saving')
static_dataset_path = os.path.join(DATA_INTERIM, 'static_data.pkl')
static_data.to_pickle(static_dataset_path)
In this part the functions necessary to solve the "offer success" prediction problem will be created: a function to create the train-test datasets, an encoder class, a function to show the feature importances of XGBoost models, and some validation and test functions. Many schemes for validation were considered: time-split, random K-fold, random train-val-test split, random train-val-test split by customer.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
The "static" dataset is adapted to be used in the "offer success" problem. That means removing all the information "from the future", from the inputs (the "success" feature will be the target, with some modifications). Also, the customer id and offer id may be removed, and a time-based split of the data perfomed. The function that takes care of that is called "get_success_data", and the docstring can be seen below.
print(sd.get_success_data.__doc__)
X_train, X_test, y_train, y_test, encoder = sd.get_success_data()
X_train.head()
y_train.head()
The following methods were considered for validation:
The time-split method was finally selected as the main validation method. The random train-val-split in customers was chosen as a sanity check (to test if the results are similar to those of the time-split).
# Create a basic model
model = Pipeline([
('encoder', encoder),
('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1,
feature_names=X_train.columns))
])
# Evaluate it with a time-split validation
evos.time_split_validation(model)
# Create a basic model
model = Pipeline([
('encoder', encoder),
('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1,
feature_names=X_train.columns))
])
# Evaluate it with a random k-fold validation
evos.random_kfold_validation(model)
# Create a basic model
model = Pipeline([
('encoder', encoder),
('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1,
feature_names=X_train.columns))
])
# Evaluate it with a random train-val split validation
evos.random_1fold_validation(model)
# Create a basic model
model = Pipeline([
('encoder', encoder),
('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1,
feature_names=X_train.columns))
])
# Evaluate it with a random train-val split validation by customer
evos.random_1fold_cust_validation(model)
# Create a basic model
model = Pipeline([
('encoder', encoder),
('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1,
feature_names=X_train.columns))
])
# Evaluate the model in the test set (time-splitted)
evos.offer_success_test(model)
Two strategies were considered to deal with the missing data:
A function to show some of the results of the imputers was also implemented.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import accuracy_score
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
print(md.BasicImputer.__doc__)
data = pd.read_pickle(os.path.join(DATA_INTERIM, 'static_data.pkl'))
print(data.shape)
data.head()
data.isnull().mean().sort_values(ascending=False)
data[data.viewed == 1].view_time.isnull().sum()
The "view_time" column has missing values only in the places where the customer didn't see the offer, which is reasonable. There is nothing to fill there. The values to fill are those from the demographics of the customer: age, gender, income.
# Create a pipeline (it is necessary to encode before filling the missing data)
encoder_imputer = Pipeline([
('encoder', pp.BasicEncoder()),
('imputer', md.BasicImputer())
])
# Encode and fill with medians and most frequent values.
%time filled = encoder_imputer.fit_transform(data)
print(filled.isnull().mean().sort_values(ascending=False))
filled.head()
vis.show_imputer_results(data, filled)
It can be seen that filling with medians and most frequent values, clearly modifies the values distribution.
All the (relevant) missing data is in the profile dataframe. The problem is that when a customer has one feature missing it has almost all of them missing. That makes it difficult to estimate the missing data but, in the worst case, the imputed missing data will at least follow the distribution of the rest of the data.
print(md.EstimatorImputer.__doc__)
# Show the initial status
data = pd.read_pickle(os.path.join(DATA_INTERIM, 'static_data.pkl'))
data.isnull().mean().sort_values(ascending=False)
# Create a pipeline (it is necessary to encode before filling the missing data)
encoder_imputer = Pipeline([
('encoder', pp.BasicEncoder()),
('imputer', md.EstimatorImputer())
])
# Encode and fill with medians and most frequent values.
%time filled = encoder_imputer.fit_transform(data)
print(filled.isnull().mean().sort_values(ascending=False))
filled.head()
vis.show_imputer_results(data, filled)
That looks better than the BasicImputer, although it takes longer than it. I could use a metric like RMSE or accuracy to see if there is an improvement.
estimator_imputer = Pipeline([
('encoder', pp.BasicEncoder()),
('imputer', md.EstimatorImputer())
])
basic_imputer = Pipeline([
('encoder', pp.BasicEncoder()),
('imputer', md.BasicImputer())
])
# Get a dataset without missing demographics
clean_data = data[data.missing_demographics == 0]
# Separate targets
target_feats = ['age', 'income', 'gender']
X = clean_data.drop(target_feats, axis=1)
y = clean_data[target_feats]
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=2018)
# Join again, but only the training set. The test set will contain NaNs in the targets
data_train = X_train.join(y_train)
X_input = data_train.append(X_test, sort=False).sort_values(by='time')
# Show the results
print(X_input.isnull().mean().sort_values(ascending=False))
X_input.head()
%time X_out_basic = basic_imputer.fit_transform(X_input)
%time X_out_estimator = estimator_imputer.fit_transform(X_input)
# Name the true values of the missing data
age_true = y_test.age
income_true = y_test.income
gender_true = pp.gender_encode(y_test.loc[:, ['gender']])
# Evaluate the Basic Imputer
print('Basic Imputer\n')
age_basic_pred = X_out_basic.loc[y_test.index, :].age
income_basic_pred = X_out_basic.loc[y_test.index, :].income
gender_basic_pred = X_out_basic.loc[y_test.index, :].gender
print('Age RMSE: {}'.format(np.sqrt(mse(age_true, age_basic_pred))))
print('Income RMSE: {}'.format(np.sqrt(mse(income_true, income_basic_pred))))
print('Gender Accuracy: {}'.format(np.sqrt(accuracy_score(gender_true,
gender_basic_pred))))
print('-'*100)
# Evaluate the Estimator Imputer
print('Estimator Imputer\n')
age_est_pred = X_out_estimator.loc[y_test.index, :].age
income_est_pred = X_out_estimator.loc[y_test.index, :].income
gender_est_pred = X_out_estimator.loc[y_test.index, :].gender
print('Age RMSE: {}'.format(np.sqrt(mse(age_true, age_est_pred))))
print('Income RMSE: {}'.format(np.sqrt(mse(income_true, income_est_pred))))
print('Gender Accuracy: {}'.format(np.sqrt(accuracy_score(gender_true,
gender_est_pred))))
print('-'*100)
Several clustering techniques were used to try to better understand the dataset, and to generate new features for the estimators. The clustering algorithms used were:
The first approach was to use all the profile data available (4 features) to perform the clustering.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, ward, fcluster
from time import time
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN
from mpl_toolkits.mplot3d import Axes3D
from sklearn.neighbors import KNeighborsClassifier
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
STATIC_DATASET_PATH = os.path.join(DATA_INTERIM, 'static_data.pkl')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
import src.features.clustering as clust
# Get the data
X_train, X_test, y_train, y_test, encoder = sd.get_success_data(drop_time=False,
anon=False)
X_train.head()
The interesting clustering is in the customers' features, as the offers are very few.
# Encode and filter relevant features
customer_feats = ['age', 'gender', 'income', 'missing_demographics',
'member_epoch_days']
offer_feats = ['difficulty', 'duration', 'offer_type', 'reward_t',
'channel_web', 'channel_social', 'channel_mobile',
'channel_email']
X_train_t = encoder.fit_transform(X_train)
X_train_t = X_train_t[customer_feats]
X_test_t = encoder.transform(X_test)
X_test_t = X_test_t[customer_feats]
# Drop duplicates and missing data
X_train_t = X_train_t.dropna().drop_duplicates()
X_test_t = X_test_t.dropna().drop_duplicates()
# Keep a copy with the original demographics
X_train_o = pp.gender_decode(X_train_t.copy())
X_test_o = pp.gender_decode(X_test_t.copy())
# Drop the irrelevant column
X_train_t = X_train_t.drop('missing_demographics', axis=1)
X_test_t = X_test_t.drop('missing_demographics', axis=1)
# Normalize
scaler = StandardScaler()
scaler.fit(X_train_t)
X_train_t = pd.DataFrame(scaler.transform(X_train_t),
index=X_train_t.index,
columns=X_train_t.columns)
X_test_t = pd.DataFrame(scaler.transform(X_test_t),
index=X_test_t.index,
columns=X_test_t.columns)
print(X_train_t.shape)
X_train_t.head()
print(X_test_t.shape)
X_test_t.head()
sns.pairplot(X_train_t)
There are some "artificial" stair-like shapes in the income-age plot, and also in the income-member_epoch_days plot. The thresholds could define clusters...
vis.pca_visualize(X_train_t)
The 2-D representation looks good for GMM or DBSCAN. In next sections some clustering algorithms will be tested.
For comparing the models and validation, I will use the Silhouette score, and DBCV (Density Based Clustering Validation). As the main objective of this section is to get some new features, I will keep the clustering indexes of the methods that yield the best score in each of the metrics. I could also keep the best sample of each algorithm.
DBCV was implemented here: https://github.com/christopherjenness/DBCV
Let's try that library.
# import DBCV
from scipy.spatial.distance import euclidean
kmeans = KMeans(n_clusters=4)
kmeans.fit(X_train_t)
cluster_labels = kmeans.predict(X_train_t)
# %time DBCV.DBCV(X_train_t.values, cluster_labels, euclidean)
Running that function for validation takes too long. I will only use the Silhouette score, and BIC in the case of GMM. Also the elbow method (when possible), and some visualizations with PCA.
# A quick visualization of K-means clustering with 4 clusters
kmeans = KMeans(n_clusters=4)
%time kmeans.fit(X_train_t)
cluster = kmeans.predict(X_train_t)
vis.pca_visualize_clusters(X_train_t, cluster)
The visualization is not too informative. Let's use the Silhouette score to determine the best number of clusters.
clusters = [{'n_clusters': n} for n in range(2, 30)]
silhouette, error, best_n = clust.validate_clustering(X_train_t, KMeans, clusters, clust.kmeans_error,
'n_clusters')
OK, let's use KMeans with n = 8
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, random_state=2018)
kmeans.fit(X_train_t)
cluster_labels = kmeans.predict(X_train_t)
# 2D PCA visualization
vis.pca_visualize_clusters(X_train_t, cluster_labels)
# Heatmap of the cluster centers' features
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
# The same as the heatmap, but with a barplot
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
# Number of customers per cluster
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Description of the clusters ():
7) Early-adopter men. Sligthly lower income and age than the mean.
Reorganized (shows that for every "male" cluster there is a very similar "female" cluster, with minor differences):
In every cluster it seems like the female customers have a slightly higher income than the male customers. Other gender choices don't seem to be enough to influence the clusters in a noticeable way.
Let's look at a dendrogram for "Ward" hierarchical clustering.
p = 50
%time linkage_matrix = ward(X_train_t)
%time ddg = dendrogram(linkage_matrix, truncate_mode='lastp', p=p)
plt.hlines(40, 0, 10 * p, colors='y')
With that limit, maybe 12 clusters would be reasonable. Let's check that with the Silhouette score.
%time sns.clustermap(X_train_t, figsize=(10, 20), method='ward', cmap='viridis')
The most noticeable division seems to be by gender. The other features also seem to cluster the data. For example, "age" seems to divide the "male" cluster pretty well.
Let's get the best 'cutting distance' between clusters, according to the silhouette score.
from collections import defaultdict
# cut_dists = np.arange(10, 150, 10)
cut_dists = np.exp(np.linspace(0, 5.5, 50))
results = defaultdict(dict)
# Cluster and gather results data
for i, max_d in enumerate(cut_dists):
tic = time()
clusters = fcluster(linkage_matrix, max_d, criterion='distance')
n_clusters = len(np.unique(clusters))
results['n_clusters'][max_d] = n_clusters
if n_clusters >= 2:
results['score'][max_d] = silhouette_score(X_train_t, clusters)
else:
results['score'][max_d] = 0
toc = time()
print('Algorithm {} of {} finished in {} seconds.'.format(
i + 1, len(cut_dists), (toc - tic)))
# Show the results
best_d, best_score = max(results['score'].items(), key= lambda x: x[1])
best_n_clust = results['n_clusters'][best_d]
print(
'The best Silhouette score is for {} clusters (maximum distance '.format(
best_n_clust) + '= {}), and its value is: {}'.format(best_d, best_score))
# Plot silhouette score vs distance
m_dists = results['score'].keys()
scores = results['score'].values()
plt.plot(m_dists, scores)
plt.title('Silhouette score')
plt.vlines(best_d, min(scores), max(scores), 'r')
plt.xlabel('Maximum distance')
# Plot silhouette score vs n_clusters
plt.figure()
n, score_n = zip(*((results['n_clusters'][d], results['score'][d])
for d in results['n_clusters'].keys()))
plt.plot(n, score_n)
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n), max(score_n), 'r')
plt.xlabel('N clusters')
# Plot n_clusters vs distance
plt.figure()
m_dists = results['n_clusters'].keys()
n = results['n_clusters'].values()
plt.plot(m_dists, n)
plt.title('N clusters')
plt.xlabel('Max distance')
plt.ylabel('N')
plt.vlines(best_d, 0, max(n), 'r')
Zooming in the area of interest...
# Plot silhouette score vs n_clusters
n_max = 20
n, score_n = zip(*((results['n_clusters'][d], results['score'][d])
for d in results['n_clusters'].keys()))
n = np.array(n)
score_n = np.array(score_n)
plt.plot(n[n<n_max], score_n[n<n_max])
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n[n<n_max]), max(score_n[n<n_max]), 'r')
plt.xlabel('N clusters')
It can be seen that about after 12 clusters the silhouette score decreases sensibly. That is in agreement with the initial intuition when looking at the dendrogram. As 12 clusters may be more informative than 2 (probably divided by gender), that may be the chosen number of clusters. Let's visualize both cases.
First, let's create a dataframe with the results
res_df = pd.DataFrame(results)
res_df.index.name = 'distance'
res_df = res_df.reset_index()
res_df.head()
max_d = best_d
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
That is just dividing by gender. Let's try with 12 clusters.
dist_12 = res_df[res_df.n_clusters == 12].distance.values[0]
print(dist_12)
max_d = dist_12
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
What about 9?
dist_9 = res_df[res_df.n_clusters == 9].distance.values[0]
print(dist_9)
max_d = dist_9
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
The 9 clusters case looks very similar to that found by K-Means, with the difference that Ward seems to have found a new cluster for the people that chose "Other" in gender. We should check if that is the case. I will keep both cases (9 and 12) to generate new features.
For Gaussian Mixture Models, besides the Silhouette score, I will also record the "Aikake Information Criterion" index (AIC). The function that allows that is gmm_aic (docstring below).
print(clust.gmm_aic.__doc__)
clusters = [{'n_components': n} for n in range(2, 30)]
silhouette, aic, best_n = clust.validate_clustering(X_train_t, GaussianMixture, clusters,
clust.gmm_aic, 'n_components')
print('The best case, based on the Silhouette score in for {}'.format(best_n))
pd.DataFrame([clusters, aic, silhouette], index=['params', 'aic', 'silhouette']).T
We should check if the case with 2 clusters is just looking for gender. I would also like to look at the 4-clusters case.
gmm = GaussianMixture(n_components=2)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
How would this look if we just divided by gender?
cluster_labels = (X_train.gender == 'F').astype(int)
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
As expected, the 2-clusters case is probably just dividing the dataset by gender. Not an interesting division (it won't give any new information)
gmm = GaussianMixture(n_components=4)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
The 4-clusters division is taking into account the people that answered "Other" to the gender question. It's a very small group in the complete dataset. It doesn't look like the Gaussian Mixture Models will give much new information.
Let's grid search for DBSCAN parameters.
# Define DBSCAN parameters
n_batch = 20
m_batch = 5
eps_list = np.linspace(0.1, 2.0, n_batch).repeat(m_batch)
min_samples_list = [5, 20, 50, 100, 300] * n_batch
params = [{'eps': e, 'min_samples': ms} for e, ms in zip(eps_list, min_samples_list)]
# Search and gather the results
silhouette = list()
n_clusters = list()
for i, param_set in enumerate(params):
tic = time()
method = DBSCAN(**param_set)
method.fit(X_train_t)
labels = method.labels_
try:
silhouette.append(silhouette_score(X_train_t, labels))
except ValueError:
silhouette.append(0)
n_clusters.append(len(np.unique(labels)))
toc = time()
print('Algorithm {} of {} finished in {} seconds. Params: {} Clusters: {}'.format(
i + 1, len(params), (toc - tic), param_set, len(np.unique(labels))))
# Show the results
best_silhouette_params = params[np.argmax(silhouette)]
print('The best Silhouette score is for {}, and its value is: {}'.format(
best_silhouette_params, max(silhouette)))
print('The number of clusters for {} is: {}'.format(
best_silhouette_params, n_clusters[np.argmax(silhouette)]))
# 3D plot for the Silhouette score
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, silhouette, linewidth=0.2, antialiased=True)
# 3D plot for the number of clusters
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, n_clusters, linewidth=0.2, antialiased=True)
results = pd.DataFrame([eps_list, min_samples_list, silhouette, n_clusters],
index=['epsilon', 'min_samples', 'silhouette_score', 'n_clusters']).T
results.sort_values(by='silhouette_score', ascending=False)
Let's try with the best values then.
# Create clusters
eps = 0.9
min_samples = 50
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
That seems to have divided the dataset by gender (including the option "Other"). Let's try other cases, with more clusters.
results.sort_values(by=['n_clusters', 'silhouette_score'], ascending=False).head(15)
# Create clusters
eps = 0.9
min_samples = 5
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# Create clusters
eps = 0.3
min_samples = 20
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
That looks like an interesting clustering!
# Create clusters
eps = 0.2
min_samples = 20
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
I will keep the case with "eps = 0.3, min_samples = 20" because it seems "reasonable" (20 is a reasonable number for something to be "statistically relevant" and 0.3 is low enough; also 10 clusters is reasonable, and it detected a 3rd relevant cluster that is based on time of membership instead of gender).
Let's first combine all the clustering labels as features, to look at them all together, and then save the resulting datasets. Some clustering algorithms are suitable for predicting the test set without retraining, others (such as hierarchical clustering) are not. The test set data shall not be used in the definition of the clusters so, in the cases in which retraining would be needed, a K-Nearest Neighbors classifier will be used to determine the cluster to which the test sample belongs. This is like, after labeling the training samples, we would define some regions based on the relative density of each label, and decide that those are the (fixed) "cluster boundaries". Then, when a test sample comes, just assign it to the cluster that corresponds to the space region in which the sample is.
To create the features and save the data the function "create_cluster_feats" will be used. The docstring can be read below.
print(clust.create_cluster_feats_4d.__doc__)
%time static_cluster1_dataset, X_train_r, X_test_r, y_train, y_test =\
clust.create_cluster_feats_4d(save=True)
X_train_r.head()
sns.pairplot(X_train_r[['kmeans_8', 'ward_12', 'dbscan_10']])
As it was seen, in the last part, that many clustering algorithms tend to divide mostly by gender, which is reasonable because the "gender" feature is discrete with only 3 possible values, I decided to try to cluster the customers without using the gender feature. One more advantage of this kind of clustering is that the samples can be visualized directly in 3D plot, without the need of using PCA. The original 3D plots were created with Plotly but, as the notebooks with Plotly 3D plots take a bit long to load, I decided to only include images of the original plots here.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from time import time
from scipy.cluster.hierarchy import dendrogram, ward, fcluster
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN
from sklearn.neighbors import KNeighborsClassifier
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster1.pkl')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
import src.features.clustering as clust
# Get the data
X_train, X_test, y_train, y_test, encoder = sd.get_success_data(
basic_dataset_path=STATIC_DATASET_PATH,
drop_time=False,
anon=False)
# Encode and filter relevant features
customer_feats = ['age', 'income', 'missing_demographics',
'member_epoch_days']
offer_feats = ['difficulty', 'duration', 'offer_type', 'reward_t',
'channel_web', 'channel_social', 'channel_mobile',
'channel_email']
X_train_t = encoder.fit_transform(X_train)
X_train_t = X_train_t[customer_feats]
X_test_t = encoder.transform(X_test)
X_test_t = X_test_t[customer_feats]
# Drop duplicates and missing data
X_train_t = X_train_t.dropna().drop_duplicates()
X_test_t = X_test_t.dropna().drop_duplicates()
# Keep a copy with the original demographics
X_train_o = X_train_t.copy()
X_test_o = X_test_t.copy()
# Drop the irrelevant column
X_train_t = X_train_t.drop('missing_demographics', axis=1)
X_test_t = X_test_t.drop('missing_demographics', axis=1)
# Normalize
scaler = StandardScaler()
scaler.fit(X_train_t)
X_train_t = pd.DataFrame(scaler.transform(X_train_t),
index=X_train_t.index,
columns=X_train_t.columns)
X_test_t = pd.DataFrame(scaler.transform(X_test_t),
index=X_test_t.index,
columns=X_test_t.columns)
# Show X_train_t
X_train_t.head()

A function was created for this purpose (it uses plotly, that's why only the resulting images will be copied here). Below is the result for a 4 clusters K-means algorithm.

clusters = [{'n_clusters': n} for n in range(2, 30)]
silhouette, error, best_n = clust.validate_clustering(X_train_t, KMeans, clusters,
clust.kmeans_error, 'n_clusters')
OK, let's use KMeans with n = 3
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=2018)
kmeans.fit(X_train_t)
cluster_labels = kmeans.predict(X_train_t)
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_train_t, cluster_labels)

Description of the clusters:
%time linkage_matrix = ward(X_train_t)
p = 50
%time ddg = dendrogram(linkage_matrix, truncate_mode='lastp', p=p)
plt.hlines(40, 0, 10 * p, colors='y')
With that limit, maybe 9 clusters would be reasonable. Let's check that with the Silhouette score. But first, another visualization of the features and clusters.
%time sns.clustermap(X_train_t, figsize=(10, 20), method='ward', cmap='viridis')
# cut_dists = np.arange(10, 150, 10)
cut_dists = np.exp(np.linspace(0, 5.5, 50))
results = defaultdict(dict)
# Cluster and gather results data
for i, max_d in enumerate(cut_dists):
tic = time()
clusters = fcluster(linkage_matrix, max_d, criterion='distance')
n_clusters = len(np.unique(clusters))
results['n_clusters'][max_d] = n_clusters
if n_clusters >= 2:
results['score'][max_d] = silhouette_score(X_train_t, clusters)
else:
results['score'][max_d] = 0
toc = time()
print('Algorithm {} of {} finished in {} seconds.'.format(
i + 1, len(cut_dists), (toc - tic)))
# Show the results
best_d, best_score = max(results['score'].items(), key= lambda x: x[1])
best_n_clust = results['n_clusters'][best_d]
print(
'The best Silhouette score is for {} clusters (maximum distance '.format(
best_n_clust) + '= {}), and its value is: {}'.format(best_d, best_score))
# Plot silhouette score vs distance
m_dists = results['score'].keys()
scores = results['score'].values()
plt.plot(m_dists, scores)
plt.title('Silhouette score')
plt.vlines(best_d, min(scores), max(scores), 'r')
plt.xlabel('Maximum distance')
# Plot silhouette score vs n_clusters
plt.figure()
n, score_n = zip(*((results['n_clusters'][d], results['score'][d])
for d in results['n_clusters'].keys()))
plt.plot(n, score_n)
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n), max(score_n), 'r')
plt.xlabel('N clusters')
# Plot n_clusters vs distance
plt.figure()
m_dists = results['n_clusters'].keys()
n = results['n_clusters'].values()
plt.plot(m_dists, n)
plt.title('N clusters')
plt.xlabel('Max distance')
plt.ylabel('N')
plt.vlines(best_d, 0, max(n), 'r')
Zooming in the area of interest...
# Plot silhouette score vs n_clusters
n_max = 20
n, score_n = zip(*((results['n_clusters'][d], results['score'][d])
for d in results['n_clusters'].keys()))
n = np.array(n)
score_n = np.array(score_n)
plt.plot(n[n<n_max], score_n[n<n_max])
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n[n<n_max]), max(score_n[n<n_max]), 'r')
plt.xlabel('N clusters')
Let's visualize the results
max_d = best_d
print(max_d)
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

It's similar, but not the same, as the kmeans clustering. Let's try with more clusters.
res_df = pd.DataFrame(results)
res_df.index.name = 'distance'
res_df = res_df.reset_index().sort_values(by='n_clusters')
res_df.head(30)
dist_19 = res_df[res_df.n_clusters == 19].distance.values[0]
max_d = dist_19
print(max_d)
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

What about 9?
dist_9 = res_df[res_df.n_clusters == 9].distance.values[0]
max_d = dist_9
print(max_d)
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

There seem to be many non-trivial clusterings. I will keep the three cases (3, 9, and 19 clusters)
clusters = [{'n_components': n} for n in range(2, 30)]
silhouette, aic, best_n = clust.validate_clustering(X_train_t, GaussianMixture, clusters,
clust.gmm_aic, 'n_components')
print('The best case, based on the Silhouette score in for {}'.format(best_n))
pd.DataFrame([clusters, aic, silhouette], index=['params', 'aic', 'silhouette']).T
Let's check the 3 clusters case, and the 16 clusters case.
gmm = GaussianMixture(n_components=3)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

gmm = GaussianMixture(n_components=16)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

Let's grid search for DBSCAN parameters.
# Define DBSCAN parameters
n_batch = 20
m_batch = 5
eps_list = np.linspace(0.1, 2.0, n_batch).repeat(m_batch)
min_samples_list = [5, 20, 50, 100, 300] * n_batch
params = [{'eps': e, 'min_samples': ms} for e, ms in zip(eps_list, min_samples_list)]
# Search and gather the results
silhouette = list()
n_clusters = list()
for i, param_set in enumerate(params):
tic = time()
method = DBSCAN(**param_set)
method.fit(X_train_t)
labels = method.labels_
try:
silhouette.append(silhouette_score(X_train_t, labels))
except ValueError:
silhouette.append(0)
n_clusters.append(len(np.unique(labels)))
toc = time()
print('Algorithm {} of {} finished in {} seconds. Params: {} Clusters: {}'.format(
i + 1, len(params), (toc - tic), param_set, len(np.unique(labels))))
# Show the results
best_silhouette_params = params[np.argmax(silhouette)]
print('The best Silhouette score is for {}, and its value is: {}'.format(
best_silhouette_params, max(silhouette)))
print('The number of clusters for {} is: {}'.format(
best_silhouette_params, n_clusters[np.argmax(silhouette)]))
# 3D plot for the Silhouette score
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, silhouette, linewidth=0.2, antialiased=True)
# 3D plot for the number of clusters
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, n_clusters, linewidth=0.2, antialiased=True)
results = pd.DataFrame([eps_list, min_samples_list, silhouette, n_clusters],
index=['epsilon', 'min_samples', 'silhouette_score', 'n_clusters']).T
results.sort_values(by='silhouette_score', ascending=False)
Let's try with the best values then.
# Create clusters
eps = 1.1
min_samples = 300
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

results.sort_values(by=['n_clusters', 'silhouette_score'], ascending=False).head(15)
# Create clusters
eps = 0.2
min_samples = 20
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

It seems to have found a "core" dense area, and a less dense outer zone. Also some small "satellite" clusters...
# Create clusters
eps = 0.3
min_samples = 5
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

# Create clusters
eps = 0.2
min_samples = 50
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

# Create clusters
eps = 0.5
min_samples = 50
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

# Create clusters
eps = 0.5
min_samples = 100
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_
# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
# visualize_3d_clusters(X_s_train, cluster_labels)

This seems to divide a dense core from the periphery... could be useful.
I will keep the cases with (eps, num_samples) = (0.2, 20) and (eps, num_samples) = (0.5, 100), that seem to produce interesting clusters.
The function to save the dataset with the new features is "create_cluster_feats_3d". Its docstring can be seen below.
print(clust.create_cluster_feats_3d.__doc__)
%time static_cluster3d_dataset, X_train_r, X_test_r, y_train, y_test =\
clust.create_cluster_feats_3d(save=True)
X_train_r.head()
cluster_feats = ['3d_kmeans_3', '3d_ward_3', '3d_ward_9', '3d_ward_19',
'3d_gmm_3', '3d_gmm_16', '3d_dbscan_02_20', '3d_dbscan_05_100']
sns.pairplot(X_train_r[cluster_feats])
The idea of lagged features is to use past values of target features as features in the predictor. As the period of time in the simulation is very short, there are not many possible lagged features, but in any case, some of them can be calculated.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster3d.pkl')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.features.lagged as lag
data = pd.read_pickle(STATIC_DATASET_PATH)
print(data.shape)
data.head()
data.columns
Three functions were created to obtain the wanted lagged features. Their docstring can be seen below
print(lag.fill_one_lagged_success.__doc__)
print(lag.fill_user_lagged_success.__doc__)
print(lag.fill_lagged_success.__doc__)
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'),
orient='records', lines=True)
%time data_lag = lag.fill_lagged_success(data, portfolio)
print(data_lag.columns)
data_lag.filter(regex='.*_ratio').describe()
data_lag.to_pickle(os.path.join(DATA_PROCESSED, 'static_cluster_lagged.pkl'))
The idea is to try to predict the profits that a customer will generate in the 10 days after an offer is sent to them. The 10 days period was selected for two reasons.
First it is the longest period of "duration" for an offer, so if a shorter period was selected that could mean a very important disadvantage to those offers that are "more attractive" to the customer towards the end of their duration (there could be such offers).
Second, the "effect" period of the offers seems to be larger than their duration (as seen from the spending-time chart), and that makes a lot of sense, since some offers wouldn't be reasonable if that wasn't the case (for examples BOGO offers, look like they would be unprofitable unless they have a longer term effect).
One of the problems to calculate the profit in 10 days is that I don't have the production costs. That makes it impossible to really get the profits numbers, so they were approximated as the total spending of the customer in the period minus the reward that was given (if any).
The other problem that we have, is the overlapping of effects between consecutive offers. That is studied below. In this first approach, it was ignored, hoping that, statistically, it would, more or less, cancel. That problem could be addressed in several times, for example, by calculating the "profit per day, in the period without overlapping, inside the 10 days period". But that would require other hypothesis, like "the effects of the offers don't vary too much in their duration" (so that taking the mean of a sub-period is a, more or less, accurate measurement of the total mean). But, as there was a time deadline, that possibility wasn't explored this time.
One final issue, that complicates things a bit, is the fact that the simulation data we have is for a very short period of time. That means that we cannot use a time-based validation scheme with many time-splits. A test set was separated, but no more than that.
One particularity of this problem is that we have to consider the possibility of not sending any offer at all. That will be called the "null offer", and will be dealt with.
Solution description: The idea is to create two Machine Learning estimators:
The final estimator combines both estimators by multiplying the probabilities from the first, with the expected profits from the second.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster_lagged.pkl')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.data.profit_10_days_dataset as p10
# Read the data
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
static_data = pd.read_pickle(STATIC_DATASET_PATH)
# Basic Preprocessing
print('Basic preprocessing')
%time data, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)
# Split
received, viewed, completed, transactions = pp.split_transcript(data)
We have to differentiate the case when the customer has viewed an offer in the last 10 days, from the case in which it hasn't.
static_data.viewed.mean()
So, 75% of the sent offers are seen. The remaining 25% will be used as samples of the "null offer" (it would be possible to get more samples by looking at all the periods in which there are no seen offers, but that would complicate things and doesn't seem necessary).
# Create empty time series
ts = pd.DataFrame(np.zeros((data.time.nunique(), data.person.nunique())),
index=data.time.unique(), columns=data.person.unique())
# Add the transactions data
ts.update(transactions.pivot_table(index='time',
columns='person',
values='amount',
aggfunc='sum'))
sending_times = received.time.unique()
for sent in sending_times:
plt.vlines(sent, ts.mean(axis=1).min(), ts.mean(axis=1).max(), colors='r')
plt.text(sent, ts.mean(axis=1).max()*0.8, 'sent!', dict(color='r', size=16))
ts.mean(axis=1).plot()
print('The offers were sent at these times: {}'.format(sending_times))
How big is the difference in time between consecutive offers?
sending_times[1:] - sending_times[:-1]
It can be seen that the effect of offers last for, at least about 170 hours, that is 7 days, but possibly more. Let's take a 10 days period as the "relevant" period for all offers. If we took a different period for each offer we would be comparing them in an unfair way.
A function was created to show the "relevant" periods for the offers of a customer. It can show the offers that get seen or all the offers.
# Get the data for one user alone
user = static_data[static_data.person == static_data.person[1]]
interesting_offers = user.offer_id.tolist() + ['no_offer']
offer_times = p10.get_offers_ts(user, portfolio, data, viewed=False)
offer_times.loc[:, interesting_offers].plot()
plt.title('All the offers\' relevant periods for one user')
p10.get_offers_ts(user, portfolio, data, viewed=True).loc[:, interesting_offers].plot()
plt.title('Only the viewed offers\' relevant periods for one user')
It is very clear that the overlapping exists, and may be relevant (less so if we consider only the viewed offers). The assumption will be that an offer will randomly overlap with all the others, and the effect will be, to a certain point, cancelled, statistically.
If the offers, in mean, behave similarly in their full period of interest, a model could be implemented on a day-to-day basis, and the overlapping zones could be ignored, or taken into account in an easier way, but that won't be explored here.
A function was implemented to add the amount spent by each customer in the following 10 days after the reception of an offer. Running it in the full dataset takes about 25 minutes. If the dataset was already created it will just load it (much faster).
# The function below generates the profit 10 days static dataset
STATIC_SPENT_10_DAYS = os.path.join(DATA_PROCESSED, 'static_spent_10_days.pkl')
if os.path.exists(STATIC_SPENT_10_DAYS):
filled = pd.read_pickle(STATIC_SPENT_10_DAYS)
else:
filled = p10.get_spent_days_static(static_data, data)
filled.to_pickle(STATIC_SPENT_10_DAYS)
print(filled.shape)
filled.head()
How is the spending distributed among the offers?
filled.groupby('offer_id').spent_10_days.sum().sort_values().plot(kind='barh')
And what's the mean spending per offfer id?
filled.groupby('offer_id').spent_10_days.mean().sort_values().plot(kind='barh')
What about the mean spending for each kind of offer?
filled.groupby('offer_type').spent_10_days.mean().sort_values().plot(kind='barh')
Interesting, it looks like the "discount" and "bogo" offers tend to produce a bit more of income.
After collecting the information about the money spent by customers there is still work to do until we have a usable dataset.
In particular, the "null offer" shall be added, and that makes the dataset a bit tricky. The "null offer" shall be added when predicting the profits for a viewed offer, but not when predicting the probability of a view to happen (because the null offer is filled when the offers were not seen, and that's what we want to determine in that case). To address that, some PROFIT_COLS and VIEW_COLS were created, that contain differentiated offer information. The VIEW_COLS are a copy of the offer information columns, while the PROFIT_COLS are also a copy but sometimes contain the information of the null offer.
Below are the docstrings of "fill_null_offer" and "get_profit_10_days_data". It is also included "split_view_profit" that gets one dataset (e.g.: X_train) and divides it in the parts needed for the "views prediction" and for the "viewed offer profits prediction".
Finally, an extension of the BasicEncoder was created, that accepts new custom features to LabelEncode (it was needed for the "offer_id" feature, and some others that finally were not used).
print(p10.fill_null_offer.__doc__)
print(p10.get_profit_10_days_data.__doc__)
print(p10.BasicEncoderProfits.__doc__)
Below, the datasets are generated and shown. Note that "y" has 2 columns, one for each estimator.
# Get the data
X_train, X_test, y_train, y_test, encoder, view_cols, profit_cols =\
p10.get_profit_10_days_data(fill_null=True,
target=['viewed', 'profit_10_days'], drop_offer_id=False)
print(X_train.shape)
print(y_train.shape)
X_train.head()
y_train.head(15)
print('View columns: \n{}\n'.format(view_cols))
print('Profit columns: \n{}'.format(profit_cols))
Then a profits predictor was created, that contains 2 sklearn Pipelines: one to predict the view probabilities and another to predict the profits given a viewed offer. Both are combined in the predictor to give a final result. The ProfitsPredictor is a BaseEstimator and RegressorMixin, and follows scikit-learn conventions.
In the "Experiments and Results" section a complete training and evaluation example can be seen.
print(p10.ProfitsPredictor.__doc__)
After having a predictor for the expected profits in 10 days, some functions are required to decide which offer will be shown to each customer.
To do that "predict_profit_with_offer" can be used to get the predictions for every sample in the dataset, but changing the "offer features" for those of a particular offer.
Then, "choose_offer" loops around all possible offers in the portfolio, including an added "null offer", and picks the best for each customer in each case (note that the same customer may get different offers in different moments in time because, even if the "absolute time features" were dropped [assuming stationarity], we are considering "lagged" features that will change with the customer's history).
Those functions will be seen in action when an AB test "quasi-experiment" is run in the "Experiments and Results" section.
The docstrings are below.
print(p10.predict_profit_with_offer.__doc__)
print(p10.choose_offer.__doc__)
Many experiments were run, with different features and hyperparameters. Below, the most successful one will be reproduced. To perform the "Grid Search" that lead to this result, took about 37 minutes, that's why it will be omitted here, but a boolean (grid_search) can be set to do it fully.
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
# Get the data
STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster_lagged.pkl')
X_train_val, X_test, y_train_val, y_test, encoder = sd.get_success_data(
basic_dataset_path=STATIC_DATASET_PATH,
drop_time=False)
# Time-split validation datasets
X_test = pp.drop_time_dependent(X_test)
X_train, X_val, y_train, y_val = sd.time_split(X_train_val,
y_train_val,
time_limit=370)
print(X_train.shape)
print(y_train.shape)
X_train.head()
print(X_val.shape)
print(y_val.shape)
X_val.head()
print(X_test.shape)
print(y_test.shape)
X_test.head()
cluster1_feats = ['kmeans_8', 'ward_12', 'dbscan_10']
cluster3d_feats = ['3d_kmeans_3', '3d_ward_3', '3d_ward_9', '3d_ward_19',
'3d_gmm_3', '3d_gmm_16', '3d_dbscan_02_20', '3d_dbscan_05_100']
base_model = Pipeline([
('encoder', pp.BasicEncoder()),
('imputer', md.BasicImputer(fill_mode=cluster1_feats + cluster3d_feats)),
('estimator', XGBClassifier(max_depth=7, n_estimators=200, n_thread=0,
random_state=2018))
])
# Grid search for better parameters (to do so, set "grid_search" to True;
# it can take up to 40 minutes, more or less).
grid_search = True
if grid_search:
parameters = {
'estimator__max_depth': [4, 7],
'estimator__n_estimators': [10, 200, 500],
'estimator__subsample': [0.5, 1.0],
'estimator__colsample_bytree': [0.5, 0.7, 1.0],
'estimator__colsample_bylevel': [0.5, 0.7, 1.0]
}
cv = GridSearchCV(base_model, parameters, cv=3, n_jobs=-1)
%time cv.fit(X_train, y_train)
print('The best parameters are:')
print(cv.best_params_)
print('-'*100)
model = cv.best_estimator_
model.get_params()
else:
model = Pipeline([
('encoder', pp.BasicEncoder()),
('imputer', md.BasicImputer(fill_mode=cluster1_feats + cluster3d_feats)),
('estimator', XGBClassifier(max_depth=4, n_estimators=200,
subsample=1.0, colsample_bytree=0.5,
colsample_bylevel=1.0, random_state=2018))
])
trained_model, y_train_pred, y_val_pred = evos.time_split_validation(model,
basic_dataset_path=STATIC_DATASET_PATH)
evos.random_1fold_cust_validation(model, basic_dataset_path=STATIC_DATASET_PATH)
vis.show_feat_importances(model, X_train)
The cluster features are taken into account, but they are not the most relevant, and don't seem to change the results too much. When the lagged features are added, there is a small improvement in the F1-score.
evos.offer_success_test(model, basic_dataset_path=STATIC_DATASET_PATH)
%reset -f
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import mean_squared_error as mse
import scipy.stats as stats
ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')
STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_spent_10_days.pkl')
import sys
sys.path.append(SRC)
import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.data.profit_10_days_dataset as p10
import src.visualization.visualize as vis
# Get the data
X_train, X_test, y_train, y_test, encoder, view_cols, profit_cols =\
p10.get_profit_10_days_data(fill_null=True,
target=['viewed', 'profit_10_days'], drop_offer_id=False)
print(X_train.shape)
print(y_train.shape)
X_train.head()
print(X_test.shape)
print(y_test.shape)
X_test.head()
model = p10.ProfitsPredictor(encoder=encoder, view_cols=view_cols, profit_cols=profit_cols)
Due to the small amount of time instances, there will be no validation set, or grid search. Instead one model will be fitted and tested (a test set is separated).
%time model.fit(X_train, y_train)
n_feats = 20
plot_importance(model.views_model.named_steps['estimator'], max_num_features=n_feats)
plot_importance(model.profits_model.named_steps['estimator'], max_num_features=n_feats)
y_train_pred = model.predict(X_train)
print('Training error (RMSE) = {}'.format(np.sqrt(mse(y_train['profit_10_days'], y_train_pred))))
y_test_pred = model.predict(X_test)
print('Test error (RMSE) = {}'.format(np.sqrt(mse(y_test['profit_10_days'], y_test_pred))))
n_bins = 50
plt.subplot(1,2,1)
_ = plt.hist(y_test_pred, bins=n_bins)
plt.title('Predicted test profits')
plt.subplot(1,2,2)
y_test['profit_10_days'].hist(bins=n_bins)
_ = plt.title('Actual test profits')
_ = sns.kdeplot(y_test_pred, shade=True, color="r", label='Predicted test profits')
_ = sns.kdeplot(y_test['profit_10_days'], shade=True, color="b", label='Actual test profits')
_ = plt.xlim((-50, 250))
_ = plt.title('Estimated densities for the predicted and real profits in 10 days')
Ideally, to further test the model, it would be a good idea to perform an AB test. In a perfect situation we would be able to collect data from reality. In a less idealized world we would have access to a simlator, to which we can input offers. In the actual case, the only thing available is data from a simulator that was generated in the past, without any possibility of generating new data with our desired specifications. In any case, it is possible to perform what I call a "pseudo AB test".
The idea is to predict which are the best offers for each customer, then look for cases in the test set where, by chance, exactly those offers were sent. That will be the sample "with the new predictor". A random sample from the test set (of the same size as the one obtained before), will be the sample "without the predictor" (the samples may overlap, but that is not important; the important thing is that the second sample comes from the "old offer-sending procedure" distribution).
That method is far less than ideal. In particular the sample size is determined by chance and cannot be controlled. Despite that, its significance can be assessed.
# Read the data
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
# Basic Preprocessing
print('Basic preprocessing')
%time _, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)
portfolio
selected_offers, predicted_full = p10.choose_offer(model, X_test, portfolio)
selected_offers.head()
num_samples = (X_test.offer_id == selected_offers.values).sum()
print('The old procedure sent the model\'s preferred offer in {:.1f}% of the cases'.format(
100*(X_test.offer_id == selected_offers.values).mean()))
print('There are {} samples of the selected offers of the new model.'.format(num_samples))
# Old model
np.random.seed(2018)
old_res = y_test.loc[np.random.choice(y_test.index, num_samples, replace=False), :]
old_views = old_res['viewed']
old_profits = old_res['profit_10_days']
print('\n' + '-'*100)
print('Total views for the old model: {} ({:.1f}%)'.format(old_views.sum(), 100*old_views.sum()/num_samples))
print('Total profit for the old model: {}'.format(old_profits.sum()))
print('Mean and std for the old model: mean = {}, std = {}'.format(old_profits.mean(), old_profits.std()))
# New model
new_res = y_test[(X_test.offer_id == selected_offers.values)]
new_views = new_res['viewed']
new_profits = new_res['profit_10_days']
print('-'*100)
print('Total views for the new model: {} ({:.1f}%)'.format(new_views.sum(), 100*new_views.sum()/num_samples))
print('Total profit for the new model: {}'.format(new_profits.sum()))
print('Mean and std for the new model: mean = {}, std = {}'.format(new_profits.mean(), new_profits.std()))
print('-'*100)
pooled_profits = np.hstack([old_profits.values, new_profits.values])
mu_pooled = pooled_profits.mean()
sigma_pooled = pooled_profits.std()
mu_old = old_profits.mean()
mu_new = new_profits.mean()
z = (mu_new - mu_old) / sigma_pooled
print('z = {}'.format(z))
print('The p-value is: {}'.format(1 - stats.norm.cdf(z)))
So, at a significance level of 0.05, it is not possible to reject the null hypothesis. Anyway, that may only mean that we don't have enough data to decide. A properly designed experiment, with access to simulated or real data collecting, could reject the null hypothesis (one experiment, no more than that [to avoid the mistake of "trying until a good sample comes"], designed with the desired levels of significance and statistical power from the begining).
_ = sns.kdeplot(old_profits, shade=True, color="r", label='old')
_ = sns.kdeplot(new_profits, shade=True, color="b", label='new')
_ = plt.xlim((-50, 250))
_ = plt.title('Estimated densities for the old and new profits in 10 days')